Module - 4: Information Theory and Coding

On completion of this module, you will be able to:

Define information and discrete messages

Define amount of information

Explain Entropy and its property

Define information rate

Explain mutual information

Explain properties of mutual information

Introduction to Information

Information is an electrical signals or words or digital data or pictures or music etc.
The communication systems are designed to transmit different types of information from sender to receiver through medium.
Information determines the transmission efficiency of communication system.
In radio broadcasting the information is audio i.e., analog in nature.
In computers the outputs are discrete in nature. Hence these sources are called discrete sources.
Example:

Fig. 1: Radio Broadcasting.

Uncertainty & Probability

Uncertainty is depend on our surprise at an event.
For example “The sun will rise tomorrow ” Not surprising (P~1) but “The sun will not rise tomorrow ” Very surprising (P<<1).
So, the uncertainty is inversely related to probability of event.

Uncertainty ∝1 /P_event

Uncertainty = log (1 / P_event)= - log P_event

Introduction to Discrete Messages

When a signal is quantized, the receiver need determine only that quantized level which encountered at each sampling time
The discrete messages consists of those quantized levels which can be transmitted directly or encoded into anyone of a number of forms
One computer sending a bits stream over pulse modulated wire to another computer is an example of discrete system
Conversation of two people by means of sound waves is an example of a mixed system. The words in the message are discrete but the signal is continuous

Fig. 2: Discrete Sampled Signals.

Amount of Information

Consider messages m₁, m₂, m₃, m₄ ……… transmitted by a communication system and probability of occurrence of these messages are P₁, P₂, P₃, P₄ ……. etc respectively
Then the amount of information transmitted through the signal m_k is given as,

I_k = log₂ (1 / P_k)

where, P_k = Probability of occurrence of m_k

4.5 Unit of Information

Basically the unit of an information is bit, when the base of algorithm is 2.
When the base is e the unit for information is nat .
While using the base 10 the unit of information becomes hartley or decit .

Problem:

A message signal m_k is transmitted by a transmitter. The probability of occurrence of this signal is 1/2 .Calculate the information conveyed by it, in terms of bits, decit and nats.

Solution :

Given that, P_k = 1/2
Information in terms of bits ( log base 2 ).

I_k = log₂ (1 / P_k)

I_k = log₂ (2) = 1 bits

Information in terms of nats ( log base e ).

I_k = log_e (1 / P_k)

I_k = log_e (2) = 0.693 nats bit

Information in terms of decit ( log base 10 ).

I_k = log₁₀ (1 / P_k)

I_k = log₁₀ (2) = 0.301 decit

Introduction to Entropy

The Entropy is defined as the average information per message. It is represented by the symbol H, and its units are bits/message
The Entropy should be as high as possible in order to ensure maximum transfer of information .
Consider M different and independent messages m₁, m₂, ..... with probability of occurance p₁, p₂, .....
Suppose that during a long period of transmission a sequence of L messages is generated and it is very large
Then, in the L message sequence p₁L message of m₁ are transmitted, p₂ L message of m₂ are transmitted, etc.
The total information in such a sequence will be.

I_total = P₁L log₂(1/P₁) + P₂L log₂(1/P₂) + ..............

The average information per message will be

H = (I_total)/L = P₁log₂(1/P₁) + P₂ log₂(1/P₂) + ..............

H = ∑_k=1^M P_klog₂(1/P_k)

This is the expression for Entropy

Entropy for different probability

Let us consider the two message with probability P and (1-P). Then, the average information per message is

H = P.log₂(1/P) + (1-P).log₂(1/(1-P))

The maximum value of H occurs at P = 1/2 when the two messages are equal likely
The corresponding H is

H_max = (1/2).log₂(2) + (1/2).log₂(2)

H_max = 1 bit/message

So, when there are M messages, H becomes maximum when all the messages are equally likely. In this case each message has a probability p = 1/M and

H_max = ∑ (1/M).log₂M = log₂M

Property of Entropy

Average information(Entropy) is zero if the event is sure or it is impossile i.e.,

H = 0 if P_k = 0 or 1

When P_k = 1/M for all the M symbols, then the symbols are equally likelly. For such cases

Entropy is given as, H = log₂M

Upper bound on entropy is given as,

H_max = log₂M

Joint Entropy

If there are two sources S₁ and S₂ delivering the messages x_i and y_j and the x_i and x_j are not mutually independent
Then the probability of their simultaneous occurance is given by the joint probability P(x_i, x_j)
The Entropy of the joint event (x_i, x_j) is known as the Joint Entropy and it is defined as,

Thus the joint Entropy of the mutually independent sources is equal to the addition of the individual entropies
If x_i and y_j are mutually independent then conditional probability defined as
H(x) = Entropy of the source S₁ and H(y) = Entropy of the the source S₂

Conditional Entropy

The conditional Entropy H ( Y/ X ) is defined as the Entropy of Y provided that X has occured
The conditional Entropy is defined as

If x_i and y_j are independent then

So, the expression for conditional entropy is given as,

Thus for independent messages x_i and y_j the conditional Entropy is equal to the marginal Entropy

Relationship between the conditional & joint Entropy

The Expression for the joint Entropy is given as,

Similarly, can be also prove that

H (X, Y ) = H( Y ) + H (X / Y )

This is the required relation between the conditional and joint Entropies.

Rate of Information

If the source generates messages at the rate r messages per second, then the information rate is defined as,

R = r. H

Where, r = Number of message/ sec, and
H = Average information/ message

Units of information rate:
R = [ r. Messages/ second ] . [H. Information/ messages]
R = Average information per second express in bits/sec

Problem :

An anlog signal is bandlomited to 4 kHz and it is sampled at the Nyquist rate. The samples are quantized into 4 levels and each level represents one message . The probability of occurance of these levels are p₁= p₄=1/8 and p₂= p₃=3/8, Find out the information rate of the source.

Slution :

Given that, f_m=4 k Hz , therefore sampling rate = 8 kHz
So The average information per message is,

So, the information rate is,
R = r × H
= 1.8×8 kHz
= 14400 bits/ sec

Mutual Information

Mutual information is defined as,

Where, P(x/y) = The conditional probability of x provided by y occurs and P(x )= Probability of x
Mutual information measures the amount of information transferred when x is transmitted and y is received

For an ideal Noiseless Channel

P(x/y) = 1
So, from the above equation (1)

So, the mutual information is same as the self information

If channel noise is large which effect the 'y' and it is totally unrellated to 'x' then

P( x/ y ) = P(x)
So, from the above equation (1)

That means no information is transmitted

Average mutual information

Average mutual information is defined as,
I ( X; Y) represents the average source information gained per received symbol

Expressin for mutual information

From the average mutual information
Thus the mutual information is equal to the difference betweeen source Entropy H(X) and noise Entropy H(X/Y)
If we interchange X and Y then,

Source Coding

Coding theory is the study of the properties of codes and their fitness for a specific applications. Codes are used for dara compression, cryptography, error-correction and more recently also for network coding.
There are two classes of codes -
1. Source coding (Data compression)
2. Channel coding (Farward error correction).
Source coding is the process of encoding information using lesser number of bits compared to uncoded version of information.
The knowledge of the statistics of the source is required in order to develop an efficient source code.

In particular, can be assign short code-word to frequent source symbols and long codewords to rare source symbols. Such a source code is called variable-length coding.
For example: Every day in the Internet common Zip data compression is used to reduce the network load and make files smaller.

Channel Coding

Channel coding refers to the class of signal transformation designed to improve communication performance.
It enable the transmitted signals to better withstand the effects of various channel impairments.
The aim of channel coding theory is to find codes which transmit quickly, contain many valid code words and can correct or at least detect many errors.

Discrete Memoryless Channel

Channel coding refers to the class of signal transformation designed to improve communication performance.
A discrete memoryless channel (DMC) is a statistical model with an input of X and output of Y, which is a noisy version of X (here both are random variables).
In each time slot the channel accepts an input symbol X selected from a given alphabet and emit an output symbol Y.
The channel is said to be discrete when both of the alphabet X and Y have finite sizes.
It is said to be memoryless when the current output symbol depends only on the current input symbol and not not on the input before or after.

Code efficiency

Consider a discrete memoryless source on the alphabet.

Forward Transition Matrix

Consider a discrete memoryless source on the alphabet.

A convennient way of describing DMC is to arrange the transition probability of the channel in the form of a matrix.
The (j, k) entry of the matrix is P(Y = y_k |X = x_j ), which is called forward transition probability.
Each row of the channel matrix P corresponds to a fixed channel input, whereas each coloumn of the matrix corresponds to a fixed channel output.

Binary symmetric channel (BSC)

It is a special case of DMC with J = K = 2. The channel has two input symbols (x₀=0, x₁=1)and two output symbols (y₀=0, y₁=1).
The channel is symmetric because the probability of receiving a 1 if a 0 is sent is the same as the probability of receiving a 0 if a 1 is sent.

Figure : Transition probability diagram of binary symmetric channel

The conditional probability of error is denoted by p and its transition probability matrix is given by.

Channel Capacity

Channel capacity, C is defined as ‘the maximum mutual information I(X ; Y) in any single use of the channel, where the maximization is over all possible input probability distributions

Shannon-Fano Algorithm

If the messages are not equally likely. Then H < N or ( H/ N ) < 1 and each binit carries less than 1 bit of average information.
Then this type of problem can be solved by assigning a different code for each message
The number of bits in a code word will decrease as the message become a more likely message
The Shannon-Fano algorithm uses this technique to get an efficient code
Consider table given below which gives five possible messages a, b, c, d and e with their corresponding probability

Message	a	b	c	d	e
Probability	1/15	1/7	1/6	1/6	1/5

Following procedure is given below to obtain the code words for each message

List the source symbols (messages) in the order of decreasing probability.
Partition the set in to two sets that are as close to equiprobable as possible, and assign 0 to the upper set and 1 to the lower set.
Continue this process, each time partitioning the sets with as nearly equal probababilities as possible until further.

Fig. : Flow Chart

Problem :

Find the codeword for each message. Also calculate the code efficiency

Prefix Coding

The Prefix Code is variable length source coding scheme where no code is the prefix of any other code.
The prefix code is a uniquely decodable code.
But, the converse is not true i.e., all uniquely decodable codes may not be prefix codes.

Symbol	Prob. of Occurance	Code I	Code II	Code III
S₀	0.5	0	0	0
S₁	0.25	1	10	01
S₂	0.125	00	110	011
S₃	0.125	11	111	0111

The definition of prefix code is illustrated above in the table.
From the table, code I is not a prefix code. Code II is a prefix code. Code III is also uniquely decodable but not a prefix code.

Huffman Coding

Huffman developed a nice greedy algorithm for producing a minimum- cost (optimum) prefix code
Purpose of Humffman code is to construct minimum redundancy codes

Procedure

List the source symbol in order of decreasing probability
The two source symbol of lowest probability are assigned 0 and 1
These two source symbols are combined into a new message
The probability of this new message is equal to the sum of probability of the two original symbols
The probability of this new message is placed in the list according to its value
follow this procedure till only two message remain and assign the 0 and 1 for them

Applications

Both the .mp3 and .jpg file formats use Huffman coding at one stage of the compression
Alternative method that achieves higher compression but is slower is patented by IBM, making Huffman codes attractive
File compression: JPEG images, MPEG movies
Transmission of data over band-limited channels: Modem data compression
Vedio cards, Sound cards

Example : A discrete memoryless source has five symbols with respective probability is given below. Find the Huffman code
A_x = { a, b, c, d, e }
P_x = { 0.25, 0.25, 0.2, 0.15, 0.15 }

Solution:

Message	Probability	Code	Code Length
a	0.25	00	2
b	0.25	10	2
c	0.2	11	2
d	0.15	010	3
e	0.15	011	3

Shannon's theorem on channel capacity

A given communication system has a maximum rate of information C known as the channel capacity
Theorem: Given a source of M equal likely messages, with M >> 1, which is generating information at a rate R. Given a channel with channel capacity C. Then,
R ≤ C
There exists a coding technique such that the outptut of the source may be transmitted over the channel with probability of error in the received message which may be made arbitrary small

Shannon-hartley theorem

This theorem is complementry to the Shannon's theorem. It applies to a channel in which the noise is Gaussian
Theorem: The channel capacity of a white, bandlimited Gaussian channel is given by,
C = B . log ₂(1+ S/N ) bits/sec
where B is the channel bandwidth(BW), S is the signal power and N is the total noise within the channel BW

Trade off betweeen bandwidth and signal to noise ratio (SNR)

By Shannon-Hartley theorem channel capacity of the Gaussian channel is given as
C = B . log ₂(1+ S/N ) bits/sec
From the above equation channel capacity depends on two factors
1. Bandwidth(B) of the channel
2. Signal to noise ratio (S/N)

Noiseless channel has infinite capacity

Infinite BW has limited capacity

If BW is infinite, then channel capacity is limited because as BW increases, noise power also increases
Noise power is given as

Example :

A Gaussian channel has a 1 MHz bandwidth. Calculate the channel capacity if the signal power to noise spectral density ratio S/N_o is 10⁵ Hz . Also find the maximum information rate.

Solution :

Example :

Consider an AGWN channel with 4 KHz bandwidth and the noise power spectral density N_o/2 = 10^-12 W/Hz . The signal power required at the receiver is 0.1 mw. Calcualte the capacity of this channel.

Solution :

Error Detection & Error Correction Codes

Need for error control coding

In telecommunications, the detection and correction of errors is important for maintaining data integrity on noisy communication channel
When digital signals (groups of 0s and 1s) are transmitted from one system to another system, error may occur during transmission
The changes of a 1 to 0 or 0 to 1 during transmission is known as error
It is necessary to detect and correct the error to obtain a correct message

Error detection techniques detect the error while error correction technique reconstructs the original data.
To detect or correct errors, need to send extra (redundant) bits with data.
Error correction codes are used by the receiver. It acts alone, both to detect the presence of an error in the received data and to re-construct the data in its original form
Error correcting codes are divided in to two types
1. Block codes
2. Convolutional codes
The major difference between these two codes are block codes do not need memory while the convolution codes required memory

Linear Block Code

Linear block codes are so named because each code word in the set is a linear combination of a set of generator code words
Block codes are codes in which k bits are input called data words(message) and n bits are output called code words
To generate an (n, k) block code, the channel encoder accepts information in successive k bit blocks
For each block add (n-k) redundant bits to produce an encoded block of n-bits called a codeword

Fig. : Block Code.

Linear block codes are very easy to implement in hardware, and since they are algebraically determined, they can be decoded in constant time.
Block code is linear if and only if the modulo-2 sum of two code words is also a code word
Sum, difference and scalar multiples of codeword are also a codeword

Encoding of (n, k) linear block code

In encoding process codeword is generated by multiplying the data words with the generator matrix as shown below

Systematic Block Code (n, k)

For a systematic code, the first (or last) k elements in the codeword are information bits

Decoding of (n, k) Linear Block Code

Received bits (R) = Transmitted codeword (C) + Error (e)
Compute the syndrome vector (S): S is syndrome of (R), corresponding to the error pattern (e). It locates the position of error in received bits.

Complement the errored bit and extract estimated message from received codeword

Fig. : Error Detection in Block Coding.

Linear block codes applications

Linear codes are applied in methods of transmitting symbols (e.g., bits) on a communications channel
If errors occur in the communication, some errors can be detected by the recipient of a message block.
Linear block codes are used in many cryptosystems
It is very useful in situations where the BER of the channel is relatively low and bandwidth availability is limited.
It is used for high-speed computer memory
   1. In high speed memory, bandwidth is limited because the cost per bit is relatively high compared to low-speed memory like disks
   2. The error rates are usually low and tend to occur by the byte so a SEC/DED (singleerror-correcting/double-error-detecting) coding scheme for each byte provides sufficient error protection
   3. Error coding must be fast in this situation because high throughput is desired

Hamming code

Hamming distance

The Hamming distance between two code words is defined as the number of locations in which their respective elements differ.
For example consider two code words

Hamming weight

The Hamming weight of a code word is defined as the number of nonzero elements in the code word.
For example: Consider the codeword c =1 1 0 1 0 1 0 0 the number of non zero elements is 4 hence the hamming weight is 4.

Convolution Code

Convolutional codes were introduced in 1955 by Elias as an alternative to Block codes. Convolutional codes are different from block codes by the existence of memory in the encoding scheme.
Convolutional coding is known to be one of the most frequently used error correction techniques relative to digital communication.These codes can be used for correcting random errors, burst errors or both. Convolutional codes are also known as recurrent codes.
In convolution code the message bit stream is encoded in a continuous manner rather than in pieces as in block codes.
In convolutional codes, each block of k bits is mapped into a block of n bits.
These n bits are not only determined by the present k message bits but also by the previous message bits.
It works like a state machine, where output is determined by the input and the state of the machine.
They are generated using shift registers.
An (n, k, m) convolution code has
   k --> number of input bits
   n --> number of output bits
   m --> number of memory elements (flip flops) in register

Code Rate

The quantity R is called the code rate of the encoder and is defined by
Example 1:

➢ Number of message bits k = 1
➢ Number of encoded bits n = 2
➢ Rate ½ means for each one-input bit, encoder provides 2 output bits. Encoder operates on one-bit at a time.
➢ Suppose input sequence = 10110, then total number of encoded output bits = 5 x 2 = 10.
Example 2:

➢ Number of message bits k = 1
➢ Number of encoded bits n = 3
➢ Rate 1/3 means for each one-input bit, encoder provides 3 output bits. Encoder operates on one-bit at a time.
➢ Suppose input sequence = 10110, then total number of encoded output bits = 5 x 3 = 15.
Example 3:

➢ Number of message bits k = 2 (two input bits are processed at a time)
➢ Number of encoded bits n = 3 (each group of 2 inputs are transformed into 3 bits)
➢ Rate 2/3 means for each group of 2 input bit, encoder provides 3 output bits. Encoder operates on 2 bit at a time.
➢ Suppose input sequence = 1011, then total number of encoded output bits = 3 + 3 = 6.

Constaint Length

The constraint length is defined as the number of shifts over which single message bit can can influence the encoder output
The quantity L is called the constraint length of the code and is defined by
   Constraint length, L = n (m+1)
   where, m = number of memory registers
   n = number of encoded bits per message bits
It is measured in terms of encoded output bits

Code Dimension

The dimension of the code depends on n, k and L
where, n represents the number of encoded bits per message bit, k is the number of message bits taekn at a time by the encoder and m is the encoder's memory
The code dimension is therefore represented by (n, k, m )

Advantages of convolution codes

The decoding delay of these codes is short,since convolutional codes operated on smaller blocks
Convolutional codes requires a smaller amount of storage
Loss of synchronization is not as serious as in the system

Disadvantages of convolution codes

Complex implementation
They are under developed codes
Analyzing the performance is very difficult

Applications of convolution codes

Widely used in satellite communications
It is used in digital speech ( the required BER is 10^5 )
It is used in digital cellular mobile communications, universal mobile telecommunication systems in order to achieving coding gain at moderate rate
It is used in GSM systems

Introduction to Information

Fig. 1: Radio Broadcasting.

Uncertainty & Probability

Introduction to Discrete Messages

Fig. 2: Discrete Sampled Signals.

Amount of Information

4.5 Unit of Information

Problem:

Solution :

Introduction to Entropy

Entropy for different probability

Property of Entropy

Joint Entropy

Conditional Entropy

Relationship between the conditional & joint Entropy

Rate of Information

Problem :

Slution :

Mutual Information

For an ideal Noiseless Channel

If channel noise is large which effect the 'y' and it is totally unrellated to 'x' then

Average mutual information

Expressin for mutual information

Source Coding

Channel Coding

Discrete Memoryless Channel

Code efficiency

Forward Transition Matrix

Binary symmetric channel (BSC)

Figure : Transition probability diagram of binary symmetric channel

Channel Capacity

Shannon-Fano Algorithm

Following procedure is given below to obtain the code words for each message

Fig. : Flow Chart

Problem :

Prefix Coding

Huffman Coding

Procedure

Applications

Example : A discrete memoryless source has five symbols with respective probability is given below. Find the Huffman code Ax = { a, b, c, d, e } Px = { 0.25, 0.25, 0.2, 0.15, 0.15 }

Solution:

Shannon's theorem on channel capacity

Shannon-hartley theorem

Trade off betweeen bandwidth and signal to noise ratio (SNR)

Noiseless channel has infinite capacity

Infinite BW has limited capacity

Example :

Solution :

Example :

Solution :

Error Detection & Error Correction Codes

Need for error control coding

Linear Block Code

Fig. : Block Code.

Encoding of (n, k) linear block code

Systematic Block Code (n, k)

Decoding of (n, k) Linear Block Code

Fig. : Error Detection in Block Coding.

Linear block codes applications

Hamming code

Hamming distance

Hamming weight

Convolution Code

Code Rate

Constaint Length

Code Dimension

Advantages of convolution codes

Disadvantages of convolution codes

Applications of convolution codes

Example : A discrete memoryless source has five symbols with respective probability is given below. Find the Huffman code
A_x = { a, b, c, d, e }
P_x = { 0.25, 0.25, 0.2, 0.15, 0.15 }